Chromosomal-level genome assembly of the long-spined sea urchin Diadema setosum (Leske, 1778)

The long-spined sea urchin Diadema setosum is an algal and coral feeder widely distributed in the Indo-Pacific that can cause severe bioerosion on the reef community. However, the lack of genomic information has hindered the study of its ecology and evolution. Here, we report the chromosomal-level genome (885.8 Mb) of the long-spined sea urchin D. setosum using a combination of PacBio long-read sequencing and Omni-C scaffolding technology. The assembled genome contains a scaffold N50 length of 38.3 Mb, 98.1% of complete BUSCO (Geno, metazoa_odb10) genes (the single copy score is 97.8% and the duplication score is 0.3%), and 98.6% of the sequences are anchored to 22 pseudo-molecules/chromosomes. A total of 27,478 gene models have were annotated, reaching a total of 28,414 transcripts, including 5,384 tRNA and 23,030 protein-coding genes. The high-quality genome of D. setosum presented here is a valuable resource for the ecological and evolutionary studies of this coral reef-associated sea urchin.

Diadema setosum (Leske, 1778, NCBI:txid31175) in the order Diadematoida, commonly known as the porcupine or long-spined sea urchin, is considered one of the oldest known extant species in the genus Diadema [16].D. setosum displays features of a typical sea urchin, including a dorso-ventrally compressed body equipped with particularly long, brittle, and hollow spines that are mildly venomous [17,18].This species can be easily differentiated from other Diadema species by the presence of five distinctive white dots at the aboral side around the anal pore between the ambulacral grooves (Figure 1A).Sexually matured individuals have been documented to have an average weight from 35 to 80 g and an average test size from 7 to 8 cm in diameter and approximately 4 cm in height [16,19].Due to its high invasiveness to localities beyond its natural range, D. setosum is now widely distributed in the tropical regions throughout the Indo-Pacific basin and can now be found latitudinally from Japan to Africa and longitudinally from the Red Sea to Australia [20].D. setosum can thrive at depths of up to 70 m below sea level and is usually reef-associated [21].It is a prolific grazer that feeds on the macroalgae that can be found on the surface of various substrata, as well as the algae that are associated with the coral skeleton [22,23].While a normal level of grazing eliminates competitive algae and can potentially offer a more suitable environment for coral settlement and development, overgrazing results in a reduction in coral community complexity, which in turn deteriorates the reef ecosystem and reduces the complexity of the coral community [24,25].
Furthermore, overpopulated sea urchins can reduce coral recruitment and the growth of juvenile coral can be hindered [26][27][28].Here, we present the chromosomal-level genome assembly of D. setosum.This valuable resource provides insights into the ecology and evolution of echinoderms, thereby enhancing further studies on sea urchins.

CONTEXT
Here, we report a high-quality genome assembly of D. setosum in the order Diadematoida and family Diadematidae.

Collection and husbandry of samples
The long-spined sea urchins, D. setosum, were collected at the coastal area of the Tolo Channel in Hong Kong (22.4872, 114.3082) in November 2022.The animals were maintained in 35 ppt artificial seawater at 23 °C until the DNA and RNA isolation, and fed with frozen clams or shrimps once a week.

Isolation of high molecular weight genomic DNA, quantification, and qualification
High molecular weight (HMW) genomic DNA was isolated from a single individual.The urchin was first removed from the culture and the test was opened with a pair of scissors.
The internal tissue, except the gut, was snap-frozen in liquid nitrogen and ground to fine powder.DNA extraction was performed with the Qiagen MagAttract HMW kit (Qiagen Cat. No. 67563) following the manufacturer's protocol.In brief, 1 g of powdered sample was put in a microcentrifuge tube with 200 μl 1× PBS.Subsequently, RNase A, Proteinase K, and Buffer AL were added to the tube.The mixture was incubated at room temperature (∼22 °C) 3582A and CHEF DNA Size Standard-8-48 kb Ladder, Cat.No. 170-3707).The DNA was then diluted in elution buffer to prepare a 300 ng solution for gel electrophoresis.The electrophoresis profile was set as follows: 5k as the lower end and 100k as the higher end for the molecular weight; Gradient = 6.0 V/cm; Run time = 15 h:16 min; Included angle = 120°; Int.Sw.Tm = 22 s; Fin.Sw.Tm = 0.53 s; Ramping factor: a = Linear.The gel was run in 1.0% PFC agarose in 0.5× TBE buffer at 14 °C.

DNA shearing, library preparation, and sequencing
A total of 10 μg of D. setosum DNA in 120 μl elution buffer was transferred to a g-tube (Covaris Part No. 520079).The sample was then centrifuged six times at 2,000 × g for 2 min.
The sheared DNA was collected with a 2 ml DNA LoBind ® Tube (Eppendorf Cat.No. 022431048) at 4 °C until the library preparation.Overnight pulse-field gel electrophoresis was performed to examine the fragment distribution of the sheared DNA, with the same electrophoresis profile described in the previous section.The concentration and size of the library were examined with the Qubit ® Fluorometer, were annealed and bound to the SMRTbell structures in the library.Then, the library was loaded at an on-plate concentration of 90 pM using the diffusion loading mode.The sequencing was performed on the Sequel IIe System with the internal control provided by the binding kit.The sequencing was prepared and run in 30-hour movies, with 120 min pre-extension.The movie was captured by the software SMRT Link v11.0 (PacBio) and HiFi reads were generated and collected for further analysis.In total, one SMRT cell was used in the sequencing.Details of the sequencing data are listed in Table 1.

Omni-C library preparation and sequencing
An Omni-C library was constructed using the Dovetail ® Omni-C ® Library Preparation Kit (Dovetail Cat.No. 21005) according to the manufacturer's instructions.In brief, 60 mg of frozen powered tissue sample was added into 1 mL 1× PBS, where the genomic DNA was crosslinked with formaldehyde, and the DNA was then digested with endonuclease DNase I. Subsequently, the concentration and fragment size of the digested sample was validated by the Qubit ® Fluorometer, Qubit™ dsDNA HS, and BR Assay Kits (Invitrogen™ Cat.No. Q32851), and the TapeStation D5000 HS ScreenTape, respectively.Afterwards, both ends of the DNA were polished and a biotinylated bridge adaptor was ligated at 22 °C for 30 min.Next, proximity ligation between crosslinked DNA fragments was performed at 22 °C for 1 hour, followed by the reverse crosslinking of DNA and its purification with SPRIselect™ Beads (Beckman Coulter Product No. B23317).
End repair and adapter ligation were performed with the Dovetail™ Library Module for Illumina (Dovetail Cat.No. 21004).In brief, DNA was tailed with an A-overhang and ligated with Illumina-compatible adapters at 20 °C for 15 min.The Omni-C library was then sheared into small fragments with USER Enzyme Mix and purified with SPRIselect™ Beads.Subsequently, DNA fragments were isolated with Streptavidin Beads.Universal and Index PCR Primers from the Dovetail™ Primer Set for Illumina (Dovetail Cat.No. 25005) were used to amplify the DNA library.A final size selection step was completed with SPRIselect™ Beads with DNA fragments ranging between 350 bp and 1000 bp only.The concentration and fragment size of the sequencing library were assessed by the Qubit® Fluorometer, Qubit™ dsDNA HS, and BR Assay Kits, and the TapeStation D5000 HS ScreenTape, respectively.The qualified library was sequenced on an Illumina HiSeq-PE150 platform.Details of the sequencing data are listed in Table 1.Agilent 2100 Bioanalyser (Agilent DNA 1000 Reagents) was used to measure the insert size and concentration of the final library.Details of the sequencing data are shown in Table 1.

Genome assembly and gene model prediction
De novo genome assembly was completed using Hifiasm (RRID:SCR_021069) [29] with default parameters, and the Hifiasm output assembly was BLAST (RRID:SCR_004870) to the NT database, and the BLAST output was used as input for Blobtools (v1.1.1,RRID:SCR_017618) [30] to validate and remove any possible contaminations (Figure 2).
Haplotypic duplications of the primary assembly were detected and removed using purge_dups (RRID:SCR_021173) according to the depth of HiFi reads [31] with default parameters.Proximity ligation data from the Omni-C library were used to scaffold the PacBio genome by YaHS [32].A Kmer-based statistical analysis was used to estimate the heterozygosity, while the repeat content and size were analyzed by Jellyfish (RRID:SCR_005491) [33] and GenomeScope (RRID:SCR_017014) [34].Transposable elements (TEs) were annotated using the automated Earl Grey TE annotation pipeline (version 1.2) [35].The mitochondrial genome was assembled using MitoHiFi (v2.2) [36].

DATA VALIDATION AND QUALITY CONTROL
Quality checks of samples during DNA extraction and PacBio library preparation were performed by NanoDrop™ One/OneC Microvolume UV-Vis Spectrophotometer, Qubit ® Fluorometer, and overnight pulse-field gel electrophoresis.The Omni-C library was subjected to quality check by Qubit ® Fluorometer and TapeStation D5000 HS ScreenTape.
For the genome assembly, the validation of contamination scaffolds from the Hifiasm output was done by searching the NT database through BLAST.The resulting output was analysed by BlobTools (v1.1.1)[32] (Figure 2).Furthermore, a Kmer-based statistical  approach was used to estimate the genome heterozygosity.The repeat content and their size were estimated by Jellyfish [33] and GenomeScope (Figure 1E and Table 2) [34].BUSCO (v5.5.0) [40] was run to evaluate the completeness of the genome assembly and gene annotation with the metazoan dataset (metazoa_odb10).

RESULTS
A total of 18.5 Gb of HiFi bases were generated with an average HiFi read length of 8,449 bp with 21× data coverage (Table 1).The assembled genome size was 885.8 Mb, with 101 scaffolds and a scaffold N50 of 38.3 Mb in 11 scaffolds, contig N50 of 3.5 Mb in 84 contigs, and a complete BUSCO estimation of 98.1% (the single copy score was 97.8% and the duplication score was 0.3%), (metazoa_odb10) (Figure 1B; Table 3).By incorporating 67.5 Gb Omni-C data, the assembly anchored 98.6% of the scaffolds into 22 pseudochromosomes, which matches the karyotype of D. setosum (2n = 44) [45] (Figure 1C; Table 4).The assembled D. setosum genome size is comparable to other published sea urchin genomes [8][9][10][11] and to the estimated size of 804 Mb by GenomeScope with a 2.11% heterozygosity rate (Figure 1D; Table 2).Moreover, telomeric repeats were identified in 16 out of 22 pseudochromosomes (Table 5).Total RNA sequencing data was obtained from a single D. setosum individual.The final assembled transcriptome contained 135,063 transcripts, with 113,391 Trinity annotated genes (with an average length of 838 bp and a N50 length of 1,456 bp), and was used to perform gene model prediction.A total of 27,478 gene models were generated with 23,030 predicted protein-coding genes, with a mean coding sequence length of 483 amino acids (Figure 1B; Table 3).

CONCLUSION AND FUTURE PERSPECTIVES
Sea urchin D. setosum (Diadematoida) belongs to a key phylogenetic group of animals in evolutionary history.This animal is characterised by deuterostomic development and is ecologically important to coral reefs.Prior to this study, there was a limited amount of high-quality sea urchin genomes, and the genomic resource for this ecologically important Diadematoida was missing.Here, we presented a high-quality chromosomal-level genome assembly of D. setosum, providing a valuable resource and foundation for a better understanding of the ecology and evolution of sea urchins.
for 3 hours.The sample was then eluted with 120 μl of elution buffer (PacBio Ref. No. 101-633-500).Throughout the extraction progress, wide-bore tips were used whenever DNA was transferred.The eluted sample was quantified by the Qubit ® Fluorometer, Qubit™ dsDNA HS, and BR Assay Kits (Invitrogen™ Cat.No. Q32851).In total, 10 μg of DNA was collected.The purity of the sample was evaluated by the NanoDrop™ One/OneC Microvolume UV-Vis Spectrophotometer, with the standard A260/A280: ∼1.8 and A260/A230: >2.0.The quality and the fragment distribution of the isolated genomic DNA were examined by the overnight pulse-field gel electrophoresis, together with three DNA markers (-Hind III digest, Takara Cat.No. 3403; DL15,000 DNA Marker, Takara Cat.No.

A
SMRTbell library was then constructed with the SMRTbell ® prep kit 3.0 (PacBio Ref. No. 102-141-700) following the manufacturer's protocol.In brief, the sheared DNA was first subjected to DNA repair, and both ends of each DNA strand were polished and tailed with an A-overhang.Ligation of T-overhang SMRTbell adapters was then performed and the SMRTbell library was purified with SMRTbell ® cleanup beads (PacBio Ref. No. 102158-300).
Total RNA was extracted from the internal tissues of the same individual used for DNA extraction using TRIzol reagent (Invitrogen) following the manufacturer's protocol.The quality of the extracted RNA was validated with the NanoDrop™ One/OneC Microvolume UV-Vis Spectrophotometer (Thermo Scientific™ Cat.No. ND-ONE-W) and 1% agarose gel electrophoresis.The qualified samples were sent to Novogene Co. Ltd (Hong Kong, China) for the construction of a polyA-selected RNA sequencing library using the TruSeq RNA Sample Prep Kit v2 (Illumina Cat.No. RS-122-2001) and 150 bp paired-end sequencing.

Table 1 .
Genome and transcriptome sequencing information.SMRTbell structures in the library, and a final size-selection step was performed to remove the short fragments in the library with 35% AMPure PB beads.The Sequel ® II binding kit 3.2 (PacBio Ref. No. 102-194-100) was used for the final preparation of sequencing.In brief, Sequel II primer 3.2 and Sequel II DNA polymerase 2.2

Table 3 .
Genome assembly statistic and sequencing information.

Table 5 .
List of the telomeric repeats identified in the genome.

Table 6 .
Statistics of the annotated repetitive elements.